Frontiers in Digital Health
○ Frontiers Media SA
Preprints posted in the last 90 days, ranked by how well they match Frontiers in Digital Health's content profile, based on 20 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Francis, A. J. A.; Raza, A.; Patel, N.; Gajbhiye, R.; Kumar, V.; T, A.; Saikia, A.; Mibang, O.; K, V.; Joshi, K.; Tony, L.; Balasubramani, P. P.
Show abstract
The rapid growth of tele-counseling and the use of lay counselors in high-volume, low-resource mental health services has created a need for scalable tools for early detection and triage. Effective personalization now requires stratifying individuals by dominant symptom profiles, such as appetite, agency, anxiety, and sleep disturbances. Depression symptoms vary widely, even among those with similar scores, reflecting distinct psychophysiological and cognitive-affective patterns. In tele-mental-health settings, where contextual cues are limited, multimodal behavioral signals from natural interactions can complement traditional assessments. Using synchronized audio, video, and text data from the EDAIC dataset (N=275), we propose a multimodal learning framework to classify five clinically validated outcomes: Depression, Appetite disturbance, Agency impairment, Anxiety, and Sleep problems. We developed a comprehensive multimodal machine-learning pipeline, incorporating automated dataset construction, modality-specific feature extraction (acoustic, facial action unit, linguistic), and supervised learning with cross-validation. Labels were derived from validated scoring rules to ensure clinical relevance. Sentiment analysis revealed lower sentiment scores in participants with high Depression, Anxiety, or Agency scores, but no significant differences in Appetite or Sleep severity. Model performance was assessed across three scenarios: text (transcripts), phone calls (audio + transcript), and video calls (audio + video + transcript). Temporal models (CNN+BiLSTM) achieved over 65% accuracy across modalities, while a fine-tuned temporal model for depression detection using video calls reached an accuracy of 81% with an f1-score of 0.79, demonstrating that our approach performs on par with state-of-the-art methods. XGBoost excelled in phone and video calls, while Ridge classifiers performed best for text-based inputs. SHAPley analysis identified key audio and video features for detecting Depression and other symptoms. A translational avatar-based interface validated system operability, demonstrating the potential for scalable, objective mental-health assessment in tele-counseling.
Basharat, A.; Hamza, O.; Rana, P.; Odonkor, C. A.; Chow, R.
Show abstract
Introduction Large language models are increasingly being used in healthcare. In interventional pain medicine, clinical reasoning is essential for procedural planning. Prior studies show that simplified prompts reduce clinical detail in AI-generated responses. It remains unclear whether this reflects knowledge loss or simply prompt-driven suppression of information. Methods We performed a controlled comparative study using 15 standardized low back pain questions representing common interventional pain questions. Each question was submitted to ChatGPT under three conditions, professional-level prompt (DP), fourth-grade reading-level prompt (D4), and clinician-directed rewriting of the D4 response to a medical level (U4[->]MD). No follow-up prompting was allowed. Three physicians independently rated responses for accuracy using a 0-2 ordinal scale. Clinical completeness was determined by consensus. Word count and Flesch-Kincaid Grade Level (FKGL) were also measured. Paired t-tests compared conditions. Results Accuracy was highest with professional prompting (1.76). Accuracy declined with the fourth-grade prompt (1.33; p = 0.00086). When simplified responses were rewritten for clinicians, accuracy returned to baseline (1.76; p {approx} 1.00 vs DP). Clinical completeness followed the same pattern showing DP 80.0%, D4 6.7%, U4[->]MD 73.3%. Fourth-grade responses were shorter and less complex. Upscaled responses were more complex and similar in length to professional responses. Inter-rater reliability was low (Fleiss {kappa} = 0.17), but trends were consistent across conditions. Conclusions Reduced clinical detail under simplified prompts appears to reflect constrained output rather than loss of knowledge. Clinician-directed reframing restores omitted content. LLM performance in interventional pain depends strongly on prompt design and intended audience.
Shankar, R.; Goh, A.; Xu, Q.
Show abstract
BackgroundThe administrative burden of clinical documentation is a recognised contributor to clinician burnout and diminished care quality. Ambient artificial intelligence (AI) scribe technology, which uses large language models to passively record and summarise clinical encounters, has rapidly gained traction internationally. However, no published studies have examined clinician experiences with this technology in the Asia-Pacific region or within Singapores multilingual healthcare system. ObjectiveThis study explored clinician perspectives on ambient AI scribe technology at Alexandra Hospital, Singapore, focusing on perceived benefits, barriers, workflow integration, ethical considerations, and recommendations for sustained implementation. MethodsA qualitative descriptive study was conducted using semi-structured interviews with 28 clinicians across multiple specialties at Alexandra Hospital, National University Health System (NUHS). Participants were purposively sampled for diversity in role, specialty, and usage level. Interviews were analysed using reflexive thematic analysis guided by the RE-AIM/PRISM framework. The COREQ checklist was followed. ResultsFive themes emerged: (1) reclaiming presence in the clinical encounter, (2) navigating accuracy and trust in AI-generated documentation, (3) workflow disruption and adaptation, (4) privacy, consent, and ethical tensions within Singapores regulatory landscape, and (5) envisioning sustainable integration. Clinicians reported improved patient engagement and reduced cognitive burden. Persistent barriers included accuracy concerns, AI hallucinations, limited multilingual functionality, loss of documentation style, and uncertainties around compliance with the Personal Data Protection Act (PDPA). ConclusionsAmbient AI scribe technology holds promise for alleviating documentation burden in Singapores public healthcare system. Realising this potential requires attention to safety validation, multilingual capability, clinician training, and patient-centred consent aligned with local regulatory frameworks.
Zhang, K.; Zhao, Z.; Hu, Y.; Le, T.
Show abstract
ObjectiveTo evaluate the effectiveness of various Large Language Models (LLMs) in identifying reliable predictors of Electronic Nicotine Delivery Systems (ENDS) initiation among adolescents, using solely large-scale survey variable descriptions. MethodsA cohort of 7,943 tobacco-naive adolescents aged 12-16 years from the Population Assessment of Tobacco and Health (PATH) Study was analyzed to predict ENDS use at wave 5. Four instruction-tuned LLMs - GPT-4o, LLaMA 3.1-70B, Qwen 2.5-72B-Instruct, and DeepSeek-V3 - were systematically evaluated for text-based feature selection using only variable descriptions from wave 4.5. Selected features were used to train LightGBM classifiers, with model performance compared to a baseline. ResultsOur findings reveal notable consistency among the four instruction-tuned LLMs, with substantial overlap in the top predictors each model identified. These selected variables spanned critical domains such as peer and household influence, risk perception, and exposure to tobacco-related cues. LightGBM classifiers trained on PATH wave 4.5-5 data using features selected by the LLMs demonstrated strong predictive performance. Notably, Qwen 2.5-72B-Instruct achieved an AUC of 0.791 with 30 predictors, surpassing the baseline AUC of 0.768. DiscussionThe substantial overlap among the top predictors identified by different LLMs suggests a shared reasoning process, despite variations in model architecture and training. LightGBM classifiers trained on these LLM-selected features achieved performance comparable to, or exceeding, models trained on the full set of survey variables, underscoring the high quality of features selected solely from textual descriptions. Moreover, these findings are consistent with previous tobacco regulatory research, further validating the effectiveness of LLM-driven feature selection. ConclusionInstruction-tuned large language models can effectively perform text-based feature selection using survey variable descriptions alone, without accessing raw survey data. This scalable, interpretable, and privacy-preserving framework holds promise for behavioral health research and tobacco use surveillance.
Vollam, S.; Roman, C.; King, E.; Tarassenko, L.
Show abstract
A Wearable Monitoring System (WMS), comprising a chest patch, wrist-worn pulse oximeter, and arm-worn blood pressure device, was developed in preparation for a pilot Randomised Controlled Trial (RCT) on a UK surgical ward. The system was designed to support continuous physiological monitoring and early detection of deterioration. An initial prototype user interface was developed by the research team based on prior clinical experience and engineering knowledge. To ensure suitability for clinical practice, iterative user-centred refinement was undertaken through a series of clinician focus groups and wearability assessments. Six focus groups were conducted between November 2019 and May 2021 involving multidisciplinary healthcare professionals. Feedback from these sessions informed successive interface and system modifications. System development spanned the COVID-19 pandemic, during which the WMS was rapidly adapted and deployed to support clinical care on isolation wards. Feedback obtained during this period was incorporated into later versions of the system and provided a unique opportunity to examine changes in clinician priorities under pandemic conditions. Clinicians consistently prioritised alert visibility, alarm fatigue mitigation, parameter flexibility, and centralised monitoring. Notably, preferences regarding alert modality and access mechanisms evolved over time: early enthusiasm for mobile or smartphone-type devices shifted towards a preference for fixed, ward-based displays and audible alerts at the nurses station following pandemic deployment. Building on previous wearability testing in healthy volunteers, wearability testing using a validated questionnaire was completed by 169 patient participants during the RCT. The chest patch and pulse oximeter demonstrated high tolerability, whereas the blood pressure cuff showed poor wearability and was removed from the final system. These findings demonstrate the importance of iterative, clinician-led design for wearable WMS and highlight how extreme clinical contexts such as the COVID-19 pandemic can significantly reshape perceived requirements for safety-critical monitoring technologies.
Dai, H.-J.; Fang, L.-C.; Mir, T. H.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.
Show abstract
Objectives Publicly available datasets dedicated to clinical speech deidentification tasks remain scarce due to privacy constraints and the complexity of speech-level annotation. To address this gap, we compiled the SREDH-AICup sensitive health information (SHI) speech corpus, a time-aligned clinical speech dataset annotated across 38 SHI categories. Methods Two publicly available English medical-domain datasets were adapted to support speech-level de-identification, including script reformulation and controlled re-recorded by 25 participants. Additional Mandarin Chinese clinical-style materials were incorporated to extend linguistic coverage. All audio data were annotated with million-level, time-aligned SHI spans using Label Studio. Inter-annotator agreement was evaluated using Cohen's kappa, following iterative calibration rounds. The resulting corpus supports both automatic speech recognition (ASR) and speech-level recognition of SHIs. Results The final dataset comprises 20 hours of annotated audio, divided into training (10 hours, 1,539 files), validation (5 hours, 775 files), and test (5 hours, 710 files) subsets, totalling 7,830 SHI entities. The language distribution reflects the composition of the selected source materials, with 19.36 hours of English and 0.89 hours of Mandarin Chinese speech. Discussion The corpus exhibits a long-tail distribution consistent with clinical documentation patterns and highlights the limited availability of Chinese medical speech resources. These characteristics underscore both the realism of the dataset and structural challenges associated with multilingual speech de-identification. Conclusion The SREDH-AICup SHI speech corpus provides a clinically grounded, time-aligned speech dataset supporting automated medical speech de-identification research and facilitating future development of multilingual speech-based privacy protection systems.
Chowdhury, A.; Irtiza, A.
Show abstract
Background: The urgent care departments in Europe face a structural paradox: accelerating digitalisation is accompanied by a patient population that is disproportionately unable to engage with standard digital tools. An internal analysis at the Emergency Department (Akutafdelingen) of Nordsjaellands Hospital in Hilleroed, Denmark found that 43% of emergency patients struggle with digital solutions - a figure that reflects the predictable composition of acute care populations rather than any individual failing. Objective: This paper presents the design, iterative development, and secondary validation of the ED Adaptive Interface (v5): a prototype adaptive patient terminal developed in response to this challenge. The system operationalises what the author terms impairment-first design - a methodology that treats the most constrained patient experience as the primary design problem and derives the standard experience as a subset. The interface configures itself in under ten seconds via nurse-led setup, adapting across four axes of impairment: visual, motor, speech, and cognitive. System: Version 4 supports five accessibility modes, a heatmap pain assessment grid, a Privacy and Dignity panel, a live workflow tracker with care notifications, structured dual-category help requests, and plain-language medical term definitions across four languages. Version 5, reported here for the first time, introduces a Condition Worsening Escalation button, a Referral Pathway Display, a "Why Am I Waiting?" triage explainer, a Symptom Progression Log, MinSP/Yellow Card Scan simulation, expanded language support (seven languages: English, Danish, Arabic with full RTL layout, Turkish, Romanian, Polish, and Somali), and an expanded ten-item Communication Board. The entire system runs as a single 79-kilobyte HTML file with zero infrastructure requirements. Methods: To base the design on patient-generated evidence, two independent social media threads were subjected to an inductive thematic analysis (Braun and Clarke, 2006): a primary corpus of 83 entries in the Facebook group Foreigners in Denmark (collected March 2026) and a corroborating corpus in an international community group in the Aarhus region (collected April 2026). All identifiers in both datasets were fully anonymised under GDPR Article 89 research provisions prior to analysis. No participants were contacted. Generative AI tools were used to assist with drafting, writing, and prototype code development; all scientific content, data collection, analysis, and conclusions are the sole responsibility of the authors. Results: The first discourse corpus produced five major themes corresponding to the five problem areas the prototype was designed to address: system navigation and triage literacy gaps (31 entries); language and cultural barriers (6 entries); communication failures during care (5 entries); staff overload and capacity constraints (8 entries); and pain and severity assessment failures (14 entries). The corroborating dataset supported all five themes and introduced two additional themes: differential treatment of international patients and medical gaslighting as a long-term pattern of patient advocacy failure. One structural finding - the five most-liked comments incorrectly criticised the original poster for self-referring when she had received explicit 1813 telephone triage approval - directly inspired the Referral Pathway Display and "Why Am I Waiting?" features in v5. Conclusions: The convergence of design rationale and independent social evidence across all five problem categories suggests that impairment-first design is not a niche accessibility concern but a structural approach to healthcare interface quality. The prototype is ready for a structured clinical pilot using the System Usability Scale (SUS) and semi-structured staff interviews. The long-term roadmap includes full MinSP integration, hospital PMS connectivity, and clinical validation.
Haq, I. U.; Sirica, D.; Wheelock, V. L.; Benedict, R.; Sarno, M. L.; Tjaden, K.; Ozelius, L.; Firth, R.; Napoli, E.; Sweadner, K.; Brashear, A.
Show abstract
ATP1A3-related syndromes represent a continuously expanding clinical spectrum and present with an extraordinarily wide range of symptoms. New phenotypes continue to emerge, posing ongoing challenges for both diagnosis and development of treatments. In this context, telemedicine offers a unique opportunity to greatly expand outreach to patients. Remote, high-resolution assessments help refine phenotypic characterization and the identification of novel and intermediate phenotypes. In this study we aimed to determine completion rates and practicality of conducting motor, speech, and neuropsychological assessments entirely via virtual visits. Although the broader recruitment included several ATP1A3-related disorders, this virtual battery was specifically developed for subjects with RDP. Participants with other ATP1A3 phenotypes enrolled in the study contributed to evaluating the overall feasibility of the workflow but were not the target population for the full battery. We recruited individuals with suspected or confirmed diagnosis of ATP1A3-related disorders, along with familial controls, from three participant clinical sites. Participants completed all study procedures through scheduled telemedicine visits using their personal devices (tablets, laptops, smartphones). A total of 59 participants were enrolled, including 46 individuals with suspected or confirmed ATP1A3 variants and 13 family member controls. Among affected patients, 18 had RDP, 12 AHC (Alternating Hemiplegia of Childhood), 4 CAPOS (Cerebellar ataxia, Areflexia, Pes cavus, Optic atrophy, Sensorineural hearing loss), 10 were categorized as "uncertain" and 2 with "mixed" phenotype (RDP/CAPOS and RDP/AHC). The virtual assessment battery included: (i) patient history questionnaire (PHQ), (ii) structured neurological examination adapted for virtual visits, (iii) speech recording, and (iv) extensive neuropsychological assessment. Feasibility was evaluated based on completion rates for each assessment component. Remote neurological, speech and neurocognitive/psychiatric assessments were completed by most participants with ATP1A3 phenotypes, with completion rates of 78% for motor examination and 87% for speech evaluation. The observed pattern of motor and speech impairments were consistent with prior in-person evaluations, supporting the validity and feasibility for both motor and nonmotor features of remote assessment in complex genetic neurological disease.
Chowdhury, A.; Irtiza, A.
Show abstract
The 1.8 million residents of Region Hovedstaden (Denmarks Capital Region) currently lack a secure, standardized pathway for integrating continuous wearable health data into Sundhed.dk, the national electronic health record. Consumer wearables such as Apple Watch, Oura Ring, and Garmin generate longitudinal physiological data relevant to chronic disease management, yet existing workflows rely on manual, non-standardized exports incompatible with FHIR DK v6.0.2 profiles and GDPR Article 25 privacy-by-design requirements. This paper presents a conceptual five-layer microservice architecture for secure wearable data sharing, employing MitID national authentication, National Service Infrastructure (NSI) integration, and Zero Trust security controls. Requirements were derived from a mixed-methods study including surveys of 47 Danish stakeholders and systematic benchmarking of existing platforms. Results show 51.1% conditional willingness to share wearable data under secure conditions, with audit transparency and non-medical misuse identified as central trust factors. Fourteen MoSCoW-prioritized requirements (F1-F7, NF1-NF7) are mapped to architecture components, providing a traceable blueprint for closing the interoperability gap in Danish public healthcare.
Li, Y.; Zhou, H.; Blackley, S.; Plasek, J. M.; Lyu, Z.; Zhang, W.; You, J.; Centi, A.; Mishuris, R.; Yang, J.; Zhou, L.
Show abstract
Ambient intelligence-based systems are increasingly used for clinical documentation. To quantify linguistic differences associated with ambient documentation, we conducted a matched pre-post analysis of 6,026 outpatient clinical notes from Mass General Brigham following implementation of two ambient AI documentation systems (Nuance Dragon Ambient eXperience [DAX] and Abridge). Within-clinician comparisons focused on the History of Present Illness (HPI) and Assessment and Plan (A&P) sections and evaluated syntactic complexity, lexical ambiguity, linguistic variability, discourse coherence, and readability. Manual review of 50 paired notes was performed to validate findings from automated linguistic analyses. Our analyses indicate that the linguistic effects of ambient documentation are both vendor-dependent and section-specific. Across both vendors, ambient notes in HPI were longer and exhibited greater syntactic complexity (longer sentences and clauses, increased dependency distance), lower lexical ambiguity, lower language-model perplexity, and higher local and global discourse coherence. These findings indicate that ambient systems systematically restructure conversational input into more syntactically elaborated and linguistically predictable narratives, reflecting increased standardization relative to both general-domain and biomedical language models. In contrast, changes in A&P were smaller and more heterogeneous, consistent with its more structured/templated nature. Readability analyses further showed increased length and lexical complexity in ambient HPI, whereas A&P readability differences were minimal. Overall, our findings demonstrate that ambient documentation changes how clinical information is linguistically expressed and organized, with effects varying by note section, vendor, and provider role/specialty. Evaluation should therefore extend beyond efficiency to consider effects on communication, cognitive load, clinical inference, and downstream analytics.
Kim, J. E.; Holbrook, E. B.; Hron, J. D.; Parsons, C. R.
Show abstract
BackgroundConversational AI safety systems are primarily evaluated using message-level content monitoring, which assesses inputs and outputs in isolation. This message-by-message approach can miss interaction-level risks that emerge over extended conversations, including patterns discussed in reports of "AI psychosis." Critically, by the time users express overt psychosis-spectrum content, opportunities for intervention may be limited. ObjectiveWe investigated whether LLM responses gradually expand and connect interpretations beyond the users original concerns, a process we term structural drift. We also tested whether this drift can be detected early and automatically. MethodsWe developed an automated, LLM-adapted rubric-based prompt for seven domains of anomalous (psychosis-spectrum) experience, derived from phenomenological psychiatry to capture subtle shifts in subjective interpretation. In Part 1, we evaluated the rubric using gold-standard text excerpts (N = 484) adapted from clinically validated qualitative instruments. In Part 2, we analyzed 1,290 user-LLM response exchanges from 7 dialogues, using 3 different LLMs (5 repeats each), to measure (i) domain amplification (increasing score within a domain) and (ii) domain expansion (new domains appearing over time). ResultsAutomated scoring showed strong agreement with gold-standard excerpts (domain accuracy 82.7-98.9%; exact 0-3 agreement 63.6-82.7%). Across dialogues, we observed significant amplification in four domains (p < .05; d = 0.14-0.46) and domain expansion in 83.8% of dialogues (88/105; p < .001). ConclusionsAI responses can systematically expand and intensify users descriptions beyond their initial input. Taken together with the predictive-processing accounts of psychosis, the exposure itself may reinforce maladaptive inferences. Because drift is detectable from ordinary dialogue without clinical-style probing, this structural drift detection may support scalable, real-time monitoring for emerging risks before overt escalation.
Amewudah, P.; Popescu, M.; Farmer, M. S.; Powell, K. R.
Show abstract
Background: Secure text messages (TMs) exchanged among interdisciplinary care teams in nursing homes (NHs) contain clinical information that aligns with the Age-Friendly Health Systems 4Ms: What Matters, Medication, Mentation, and Mobility, yet, this information is not captured in any structured form, making it unavailable for systematic monitoring or quality reporting. Automatically extracting 4M information accurately and efficiently from these messages could enable several downstream applications within long term care settings. This task, however, is challenging because of the fragmented syntax, brevity, abbreviations, and informality of TMs. Objective: This study aimed to develop and evaluate a multi-stage 4M Entity Recognition (4M-ER) pipeline that combines a fine-tuned token classifier with large language model (LLM) revision, using only locally deployed open-source models, to improve 4M information extraction from clinical TMs. Methods: We used an expert-annotated dataset of 1,169 TMs collected from interdisciplinary teams across 16 Midwest NHs. The pipeline first identifies candidate text spans using a fine-tuned Bio-ClinicalBERT token classifier. A semantic similarity retriever then selects in-context exemplars to guide an LLM revision in which the LLM (Gemma, Phi, Qwen, or Mistral) performs boundary correction, label evaluation, and selective acceptance or rejection of candidate spans. Baselines for comparison included single-stage zero-shot LLMs, single-stage fine-tuned Bio-ClinicalBERT, and a fine-tuned LLM (Gemma) from a prior study. Ablation studies assessed the contribution of each pipeline stage and the effect of message filtering. Robustness was evaluated across 5 repeated runs. Results: The 4M-ER pipeline outperformed the previously fine-tuned Gemma LLM across all 4M domains, achieving F1 (entity type) improvements of +2 to +11 percentage points without any additional fine-tuning and at roughly half the GPU memory (12 vs 24 GB). It also improved upon single-stage fine-tuned Bio-ClinicalBERT in Mobility, Mentation, and What Matters (+0.02 to +0.05 F1). Error analysis showed that LLM revision reduced false positives by 25% to 35% by correcting misclassifications caused by conversational ambiguity, while the fine-tuned Bio-ClinicalBERT's high recall captured subtle entities that the fine-tuned Gemma missed. Silver data augmentation further improved the hardest domains, raising What Matters F1 from 0.59 to 0.67 and Mobility from 0.64 to 0.67. Ablation studies confirmed that restricting LLMs to revision only yielded optimal accuracy and efficiency. Conclusions: The 4M-ER pipeline enables accurate and scalable extraction of 4M entities from clinical TMs by combining fine-tuned Bio-ClinicalBERT with LLM revision using only locally deployed open-source models. The structured 4M data produced by the pipeline can support 4M taxonomy and ontology construction, as demonstrated in the prior work, and provides a foundation for downstream applications including real-time clinical surveillance, compliance with emerging age-friendly quality measures, and predictive modeling in long-term care settings.
Dai, H.-J.; Mir, T. H.; Fang, L.-C.; Chen, C.-T.; Feng, H.-H.; Lai, J.-R.; Hsu, H.-C.; Nandy, P.; Panchal, O.; Liao, W.-H.; Tien, Y.-Z.; Chen, P.-Z.; Lin, Y.-R.; Jonnagaddala, J.
Show abstract
Accurate recognition and deidentification of sensitive health information (SHI) in spoken dialogues requires multimodal algorithms that can understand medical language and contextual nuance. However, the recognition and deidentification risks expose sensitive health information (SHI). Additionally, the variability and complexity of medical terminology, along with the inherent biases in medical datasets, further complicate this task. This study introduces the SREDH/AI-Cup 2025 Medical Speech Sensitive Information Recognition Challenge, which focuses on two tasks: Task-1: Speech transcription systems must accurately transcribe speech into text; and Task-2: Medical speech de-identification to detect and appropriately classify mentions of SHI. The competition attracted 246 teams; top-performing systems achieved a mixed error rate (MER) of 0.1147 and a macro F1-score of 0.7103, with average MER and macro F1-score of 0.3539 and 0.2696, respectively. Results were presented at the IW-DMRN workshop in 2025. Notably, the results reveal that LLMs were prevalent across both tasks: 97.5% of teams adopted LLMs for Task 1 and 100% for Task 2. Highlighting their growing role in healthcare. Furthermore, we finetuned six models, demonstrating strong precision ([~]0.885-0.889) with slightly lower recall ([~]0.830-0.847), resulting in F1-scores of 0.857-0.867.
Yousaf, M. N.; Anwar, M. N.; Naveed, N.; Haider, U.
Show abstract
BackgroundTinnitus affects a substantial proportion of the global population and can severely disrupt sleep, mood, and daily functioning, yet the quality of mobile health apps designed for tinnitus management remains highly variable. Traditional evaluation methods, including clinical trials, expert rating scales, and small-scale surveys, rarely capture large-scale, feature-level feedback from real-world users, leaving a gap in understanding which app characteristics drive sustained engagement and satisfaction. MethodsThis study analysed 342,520 English-language reviews from 84 tinnitus-related apps on iOS and Android collected between 2015 and 2025. A pipeline first applied VADER-based preprocessing and sentiment assignment, then trained a graph neural network aspect-based sentiment analysis (GNN-ABSA) model operating on sentence-level dependency graphs to infer feature-level sentiment for domains such as sound therapy, sleep support, pricing, advertisements, stability, and user interface. ResultsThe GNN-ABSA model achieved an accuracy of 84.4% and a macro F1 score of 0.829 on unseen aspect-level test data, indicating stable performance across sentiment classes. Therapeutic features like sound masking and sleep support were associated with predominantly positive sentiment, whereas pricing, advertisements, background playback, and technical stability attracted more neutral or negative feedback over the ten-year period. ConclusionsLarge-scale, graph-based feature-level sentiment analysis provides a user-cantered perspective that complements clinical trials and expert app quality ratings, offering actionable guidance for developers seeking to prioritize design improvements, supporting clinicians in recommending suitable apps to patients, and informing the design of more explainable and user-driven digital health tools. Trial RegistrationNot applicable. This study analysed publicly available app store reviews and did not involve human participants.
Kim, S.; Guo, Y.; Sutari, S.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.
Show abstract
Social determinants of health (SDoH) are important for clinical care, but it remains unclear how much AI-captured social context is preserved after clinician editing in ambient documentation workflows. We retrospectively analyzed 75,133 paired ambient AI-drafted and clinician-finalized note sections from ambulatory care at a large academic health system. Using a rule-based NLP pipeline, we extracted 21 SDoH categories and quantified retention, deletion, and addition. SDoH appeared in 25.2% of AI drafts versus 17.2% of final notes. At the mention level, AI captured 29,991 SDoH mentions, of which 45.1% were deleted, 54.9% were retained with clinicians adding 3,583 new mentions. Insurance and marital status were most often deleted, whereas substance use and physical activity were more often retained. Deletion patterns also varied by specialty, supporting the need for specialty-aware ambient AI systems.
Coleman, T.; Mello, M.; Kazanjian, R.; Kazanjian, M.; Olsen, D.; Coleman, J.; Menna, J.
Show abstract
Frequent blood testing is a routine but burdensome reality for many children, particularly those with chronic, rare, or medically complex conditions. Repeated clinic, hospital, and laboratory visits can disrupt family life, increase stress for children and caregivers, and limit access to timely monitoring and research participation. Despite advances in pediatric care, blood collection has remained largely tethered to in-person clinical settings. This study validates a new model: safe, effective, parent-administered pediatric blood collection performed at-home. We evaluated the RedDrop ONE capillary blood collection device in a real-world, parent-administered home setting to determine whether non-clinical caregivers can reliably collect clinically meaningful blood samples from children without venipuncture, specialized training, or in-clinic support. Conducted under Institutional Review Board (IRB) oversight, this observational usability study enrolled 50 children aged 3-17 years across a geographically diverse U.S.-based pediatric population, including healthy and medically fragile children with chronic autoimmune and rare diseases. All study activities, including enrollment, consent, instruction, collection, and sample return, were completed remotely, reflecting real-world adoption conditions rather than controlled clinical environments. Parents successfully collected blood samples from their children at home with high consistency, low perceived pain, and strong overall acceptance. Across collections, blood and serum volumes were sufficient and reproducible, and laboratory analysis confirmed strong analytical concordance between samples collected from two different anatomical sites, arm and leg. Parents reported high confidence using the device, short collection times, and a high likelihood of completing collections on the first attempt. Importantly, both parents and children rated the overall experience as better than expected, and parents consistently reported that the RedDrop ONE experience was superior to traditional finger-prick and needle-based venous blood draws. Parents reported minimal child discomfort and greater flexibility by avoiding in-clinic phlebotomy visits. These benefits are especially meaningful for families managing chronic or rare pediatric conditions that require repeated blood monitoring. By enabling blood collection at-home, this model reduces travel burden, scheduling constraints, and procedural anxiety while maintaining analytical reliability. This study also demonstrated that parent-administered pediatric blood collection can support real-world clinical workflows beyond research. All samples were successfully shipped overnight at ambient temperature and processed by a CLIA-certified laboratory, supporting feasibility for remote pediatric patient monitoring and decentralized clinical trials. While lipid testing served as the representative clinical use case, the volumes and consistency achieved exceeded volume thresholds commonly required for advanced downstream applications, including proteomics, metabolomics, transcriptomics, and genomic analyses. Taken together, these findings validate parent-administered pediatric blood collection as a practical, scalable alternative to in-clinic phlebotomy for many use cases. By shifting blood collection from the clinic to the home, this approach has the potential to reduce reliance on in-person phlebotomy, integrate seamlessly into routine pediatric care, and expand access to monitoring and research for families who face geographic, logistical, or medical barriers. For health systems, researchers, and parents alike, this study supports a future in which clinically meaningful pediatric blood collection is no longer limited by healthcare facility location but instead centered on the child and family.
Cremin, C.; Elavalli, S.; Paulin, L.; Arres Reche, J.; Saad, A. A. Y. A.; Attia, A.; Minas, C.; Aldhuhoori, F.; Katagi, G.; Wu, H.; Sidahmed, H.; Mafofo, J.; Soliman, O.; Behl, S.; Pariyachery, S.; Gupta, V.; Ghanem, D.; Sajjad, H.; Cardoso, T.; El-Khani, A.; Al Marzooqi, F.; Magalhaes, T.; Sedlazeck, F. J.; Quilez, J.
Show abstract
BackgroundThe hyperpolymorphic nature and structural complexity of the human leukocyte antigen (HLA) genomic region present challenges for accurate and scalable typing across diverse sample types. While wholegenome sequencing (WGS) offers the opportunity to infer HLA genotypes without targeted enrichment, systematic benchmarks across sequencing platforms, biospecimens and coverage levels remain limited. ResultsWe assembled a multi-platform resource of WGS datasets derived from short-read (Illumina, MGI) and long-read (Oxford Nanopore Technologies R9 and R10) sequencing, spanning 29 biospecimens including cell lines, blood, buccal swab and saliva. We evaluated the performance of the HLA caller HLA*LA across 13 HLA genes, using a clinically validated assay as reference. WGSbased HLA genotyping achieved [~]95% accuracy across sequencing platforms, with Class I loci exhibiting higher accuracy than Class II. Crossplatform concordance was high, and performance remained consistent across Illumina, MGI and Oxford Nanopore chemistries. Analysis of blood, buccal swab and saliva samples showed that blood and buccal swabs supported accurate HLA inference, whereas saliva yielded reduced concordance. Downsampling experiments demonstrated that 15x coverage was sufficient to retain >95% accuracy at twofield resolution, with lower depths supporting lower-resolution typing. ConclusionsOur results demonstrate that WGS provides a robust, platformagnostic framework for accurate HLA genotyping across sample types and coverage levels. These benchmarks establish practical conditions for reliable HLA inference and underscore the utility of WGS for populationscale HLA analyses and future clinical applications.
Bauer, M. P.; van Tol, E. M.; Constansia, T. K. M.; King, L.; van Buchem, M. M.
Show abstract
BackgroundTyping in the electronic health record (EHR) takes up healthcare providers time and cognitive space and constitutes a substantial administrative burden contributing to high burnout rates in healthcare. Ambient digital scribes may improve this problem. ObjectiveTo investigate the effect of the use of Autoscriber, an ambient digital scribe, on healthcare providers administrative workload and the quality of medical notes in the EHR. MethodsA study period of 26 weeks was randomized into weeks when healthcare providers were allowed to use Autoscriber (intervention weeks) and weeks when they were not (control weeks) in a 2:1 ratio. Workload was assessed by comparing the number of characters typed in the medical note during control weeks with the number of modifications that were made to the summary produced by Autoscriber during intervention weeks. Quality of the medical note was measured by having a large language model (LLM) count the number of hallucinations, incorrect negations, context conflation errors, speculations, other inaccuracies, omissions, succinctness errors, organization errors and terminology errors per medical note. ResultsBetween 1 November 2024 and 30 April 2025, 35 healthcare providers from 14 different specialties recorded 387 consultations in intervention weeks, and 142 in control weeks. The median number of characters typed per medical note was 1079 in control weeks and the median number of modifications necessary to produce the medical note was 351 in intervention weeks, compatible with a lower workload. All types of errors occurred significantly less frequently in notes made with the support of Autoscriber than in those without, except for speculations, where the difference did not reach statistical significance. ConclusionsThe use of Autoscriber resulted in a lower workload and a higher quality of the medical note.
Flathers, M.; Nguyen, P. A. H.; Herpertz, J.; Granof, M.; Ryan, S. J.; Wentworth, L.; Moutier, C. Y.; Torous, J.
Show abstract
BackgroundMillions of people use language models to discuss mental health concerns, including suicidal ideation, but limited frameworks exist for evaluating whether these systems respond safely. Benchmarking, the practice of administering standardized assessments to language models, offers direct parallels to clinical competency evaluation, yet few clinicians are involved in designing, validating, or interpreting these assessments. AimsTo introduce mental health professionals to benchmarking language models by administering a validated clinical instrument and demonstrating how configuration decisions, measurement limitations, and scoring context affect result interpretation. MethodWe administered the Suicide Intervention Response Inventory (SIRI-2) programmatically to nine commercially available language models from three providers. Each item was presented 60 times per model (three prompt variants x two temperature settings x 10 repetitions), yielding 27,000 model responses compared against point-in-time expert consensus. ResultsTotal scores ranged from 19.5 to 84.0 (expert panel baseline: 32.5). Prompt design alone shifted individual model scores by as much as the difference between trained and untrained human groups. The best performing model approached the instruments measurement floor. All nine models consistently overrated clinically inappropriate responses that sounded supportive. ConclusionsA single benchmark score can support markedly different claims depending on the assumed standard of clinical behavior, the instruments remaining measurement range, and the configuration that produced the result. The skills required to make these distinctions must become core competencies. Benchmark results are increasingly utilized to support claims about mental health safety that may not be accurate, making it necessary to close the gap between clinical measurement and AI. Plain Language SummaryAI chatbots like ChatGPT, Claude, and Gemini are increasingly used by millions of people to discuss mental health problems, including thoughts of suicide. To assess whether these systems handle such conversations safely, researchers give them standardized tests called benchmarks and compare their answers to those of human experts. These scores are already used to argue AI systems are ready for clinical use. This study gave a well-established test of suicide response skills to nine AI models from three major companies under varying conditions. We changed how much instruction the AI received and how much randomness was built into its responses, then measured whether the scores changed. The same AI model could score like a trained crisis counselor under one set of conditions and like an untrained undergraduate under another, depending on choices the person running the test made. Every model also made the same kind of mistake: responses that sounded warm and caring were rated as appropriate, even when experts had judged them to be clinically problematic. The highest-scoring model performed so well that the test could no longer measure whether it was truly skilled or had simply exceeded the tests range. These findings show that a single score can be misleading without knowing how the test was run, whether it can still distinguish strong from weak performance, and whether it matches what the AI is used for. Mental health professionals routinely make these judgments about clinical assessments and are well positioned to bring that expertise to AI evaluation.
Stanwyck, C.; Adibi, A.; Dozie-Nnamah, P.; Alsentzer, E.
Show abstract
BackgroundLarge language models (LLMs) are increasingly piloted as chat interfaces for chart review and clinical decision support. Although leading models achieve and even exceed physician-level accuracy on exam-style benchmarks such as MedQA, recent perturbation studies show large drops in accuracy after small changes to prompts, distractor content, or answer format. Prior work has not systematically examined how these vulnerabilities unintentionally manifest in clinically realistic settings, including multi-turn chatbot interactions, free-text response formats, and tasks involving patient medical records. MethodsWe evaluated susceptibility to bias from prior chat messages across 14 LLMs (10 closed-source, 4 open-source) on two medical question-answering tasks: a boards-style benchmark (1000 MedQA test questions) and an electronic health record (EHR) information retrieval task (962 EHRNoteQA questions about real patient discharge summaries). Using a factorial design, we independently varied the presence and type of prior-chat distractors and response format across these two tasks. Distractors ranged from simple statements of incorrect answers to more realistic conversational exchanges between user and model, including interactions referencing a different patient. FindingsPrior-chat distractors produced large and consistent accuracy decrements in the MedQA multiple-choice setting, particularly when the prior message stated an incorrect answer. In this setting, insertion of this user message led to significant accuracy decreases in 13 of 14 models, with drops averaging 15{middle dot}0 percentage points across models. Effects were smaller for more plausible, conversational distractors and in free-response formats. In contrast, prior-chat bias in the discharge summary-based task was modest and inconsistent. Average accuracy decreases were under 2 percentage points across all distractor types and response formats assessed, with significant effects observed in a minority of models. InterpretationLLM performance can be biased toward incorrect answers by plausible prior-chat distractors, but these effects are highly context-dependent. We find that distraction effects are common and often substantial in the boards-style multiple-choice task, particularly when the distractor is an explicit (and unrealistic) prior message containing an incorrect answer. In contrast, these effects are markedly attenuated when the same questions are posed in free-response format and the distractor is incorporated into a clinically-realistic user-model exchange in the chat history, or when the task is switched from a boards-style vignette to a question about a real (de-identified) patient record. Taken together, these results suggest that evaluations based solely on single-turn, boards-style multiple-choice questions with unrealistic distractors may overstate the impact of prior-chat bias. These findings highlight the need to assess LLM behavior in multi-turn settings involving realistic clinical use cases, rather than relying on boards-style benchmarks for assessment of safety risks.